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Abstract — Internet worm infection continues to be one of top security threats and has been widely used by botnets to recruit new 
bots. In this work, we attempt to quantify the infection ability of individual hosts and reveal the key characteristics of the underlying 
topology formed by worm infection, i.e., the number of children and the generation of the worm infection family tree. Specifically, we 
first apply probabilistic modeling methods and a sequential growth model to analyze the infection tree of a wide class of worms. We 
analytically and empirically find that the number of children has asymptotically a geometric distribution with parameter 0.5. As a result, 
on average half of infected hosts never compromise any vulnerable host, over 98% of infected hosts have no more than five children, 
and a small portion of infected hosts have a large number of children. We also discover that the generation follows closely a Poisson 
distribution and the average path length of the worm infection family tree increases approximately logarithmically with the total number 
of infected hosts. Next, we empirically study the infection structure of localized-scanning worms and surprisingly find that most of 
the above observations also apply to localized-scanning worms. Finally, we apply our findings to develop bot detection methods and 
study potential countermeasures for a botnet (e.g., Conficker C) that uses scan-based peer discovery to form a P2P-based botnet. 
Specifically, we demonstrate that targeted detection that focuses on the nodes with the largest number of children is an efficient way to 
expose bots. For example, our simulation shows that when 3.125% nodes are examined, targeted detection can reveal 22.36% bots. 
However, we also point out that future botnets may limit the maximum number of children to weaken targeted detection, without greatly 
slowing down the speed of worm infection. 

Index Terms — Worm infection family tree, botnet, probabilistic modeling, simulation, topology, and detection. 



1 Introduction 

Internet epidemics are malicious software that can self- 
propagate across the Internet, i.e., compromise vulnera- 
ble hosts and use them to attack other victims. Internet 
epidemics include viruses, worms, and bots. The past 
more than twenty years have witnessed the evolution of 
Internet epidemics. Viruses infect machines through ex- 
changed emails or disks, and dominated the 1980s and 
1990s. Internet active worms compromise vulnerable hosts 
by automatically propagating through the Internet and 
have caused much attention since the Code Red and Nimda 
worms in 2001. Botnets are zombie networks controlled 
by attackers through Internet relay chat (IRC) systems 
{e.g., GT Bot) or peer-to-peer (P2P) systems {e.g., Storm) to 
execute coordinated attacks and have become the number 
one threat to the Internet in recent years. Since Internet 
epidemics have evolved to become more and more virulent 
and stealthy, they have been identified as one of the top 
security problems [IJ. 

The main difference between worms and botnets lies in 
that worms emphasize the procedures of infecting targets 
and propagating among vulnerable hosts, whereas botnets 
focus on the mechanisms of organizing the network of com- 
promised computers and setting out coordinated attacks. 
Most botnets, however, still apply worm-scanning methods 
to recruit new bots or collect network information ||2l, ||3|, 
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m. Is). Moreover, although many P2P-based botnets use 
the existing P2P networks to build a bootstrap procedure, 
Conficker C forms a P2P botnet through scan-based peer 
discovery (6), Q- Specifically, Conficker C searches for new 
peers by randomly scanning the entire Internet address 
space. As a result, the way that Conficker C constructs 
a P2P-based botnet is in principle the same as worm 
scanning /infection. Therefore, characterizing the structure 
of worm infection is important and imperative for defend- 
ing against current and future epidemics such as Internet 
worms and Conficker C like P2P-based botnets. 

Modeling Internet worm infection has been focused on 
the macro level. Most, if not all, mathematical models study 
the total number of infected hosts over time ID, lU, flQl , 
fTT], fT2|, (2|. The models of the micro level of worm 
infection, however, have been investigated little. The micro- 
level models can provide more insights into the infection 
ability of individual compromised hosts and the underlying 
topologies formed by worm infection. A key micro-level 
information is ''who infects whom'' or the worm infection 
family tree. When a host infects another host, they form 
a "father-and-son" relationship, which is represented by a 
directed edge in a graph formed by worm infection. Hence, 
the procedure of worm propagation constructs a directed 
tree where patient zero is the root and the infected hosts 
that do not compromise any vulnerable host are leaves 
(see Fig. [T]). To the best of our knowledge, there is yet no 
mathematical model for reflecting the structure of such a 
tree. 

The goal of this work is to characterize the Internet worm 
infection family tree, i.e., the topology formed by worm 
infection. For such a tree, we are particularly interested in 
two metrics: 
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Fig. 1. A worm tree. 



• Number of children: For a randomly selected node in 
the tree, how many children does it have? This metric 
represents the infection ability of individual hosts. 

• Generation: For a randomly selected node in the tree, 
which generation (or level) does it belong to? This 
metric indicates the average path length of the graph 
formed by worm infection. 

These two metrics reflect the underlying topology formed 
by worm infection, called the "worm tree" in short. For 
example, if the worm tree is a random graph, each host 
would infect a similar number of targets; and the average 
path length would increase approximately logarithmically 
with the total number of nodes (HI, (TH. If the worm tree 
has a power-law topology, only a very small number of 
hosts infect a large number of children, and a majority of 
hosts infect none or few children; and the average path 
length would also increase approximately logarithmically 
with the total number of nodes [13 J. Moreover, power- 
law topologies are robust to random node removal, but 
are vulnerable to the removal of a small portion of nodes 
with highest node degrees. However, random graphs are 
robust to both removal schemes (13]|. Therefore, studying 
the structure of the worm tree can help provide insights on 
detecting and defending against botnets such as Conficker 
C. 

To study these two metrics analytically, we apply proba- 
bilistic modeling methods and derive the joint probability 
distribution of the number of children and the generation 
through a sequential growth model. Specifically, we start 
from a worm tree with only patient zero and add new 
nodes into the worm tree sequentially. We then investigate 
the relationship between the two worm trees before and 
after a new node is added. From the joint distribution, 
we analyze the marginal distributions of the number of 
children and the generation. We also develop closed-form 
approximations to both marginal distributions and the joint 
distribution. Different from other models that characterize 
the dynamics of worm propagation {e.g., the total number 
of infected hosts over time), our sequential growth model 
aims at capturing the main features of the topology formed 
by worm infection {e.g., the number of children and the 
generation). 

As a first attempt, we analyze the worm tree formed by 



a wide class of worms such as random-scanning worms 
(SI, routable-scanning worms (l5l|, (H, importance-scanning 
worms (H, OFT-STATIC worms (13, and SUBOPT-STATIC 
worms [17]. For these worms, a new victim is compro- 
mised by each existing infected host with equal probability 
We then verify the analytical results through simulations. 
We also employ simulations to investigate worm infection 
using localized scanning fTSl, {19]. Finally, we apply our 
analysis and observations to develop methods for detecting 
bots and study potential countermeasures for a botnet {e.g., 
Conficker C) that uses scan-based peer discovery to form a 
P2P-based botnet. 

Through both analytical and empirical study, we make 
several contributions from this research as follows. First, 
if a worm uses a scanning method for which a new 
victim is compromised by each existing infected host with 
equal probability, the number of children is shown both 
analytically and empirically to have asymptotically a ge- 
ometric distribution with parameter 0.5. This means that 
on average half of infected hosts never compromise any 
target and over 98% of infected hosts have no more than 
five children. On the other hand, this also indicates that a 
small portion of hosts infect a large number of vulnerable 
hosts. Moreover, the generation is demonstrated to closely 
follow a Poisson distribution with parameter Hn — 1, where 
n is the number of nodes and Hn is the n-th harmonic 
number [20]. This means that the average path length of 
the worm tree increases approximately logarithmically with 
the number of nodes. Second, if a worm uses localized 
scanning, the number of children still has approximately 
a geometric distribution with parameter 0.5. Moreover, the 
generation still follows a Poisson distribution, but with 
the parameter depending on the probability of local scan- 
ning. Therefore, most previous observations also apply to 
localized-scanning worms. Finally, a direct application of 
the observations of the worm tree is on the bot detection 
in Conficker C like botnets. We show both analytically and 
empirically that while randomly examining a small portion 
of nodes in a botnet {i.e., random detection) can only expose 
a limited number of bots, examining the nodes with the 
largest number of children {i.e., targeted detection) is much 
more efficient in detecting bots. For example, our simula- 
tion shows that when 3.125% nodes are examined, random 
detection exposes totally 9.10% bots, whereas targeted de- 
tection reveals 22.36% bots. On the other hand, we also 
point out that future botnets can potentially use a simple 
method to weaken the performance of targeted detection, 
without greatly slowing down the speed of worm infection. 
To the best of our knowledge, this is the first attempt 
in understanding and exploiting the topology formed by 
worm infection quantitatively. 

The remainder of this paper is structured as follows. 
Section |2] presents our sequential growth model and as- 
sumptions used in analyzing the worm tree. Section Ogives 
our analysis on the worm tree. Section |4] uses simulations 
to verify the analytical results and provide observations on 
the worm tree using the localized-scanning method. Section 
13 further develops bot detection methods and studies po- 
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tential countermeasures by future botnets. Finally, Section 
[6] discusses the related work, and Section [71 concludes this 
paper. 

2 Worm Tree and Sequential Growth 
Model 

In this section, we provide the background on the worm 
tree, and present the assumptions and the growth model. 

An example of a worm tree is given in Fig. [T] Here, 
patient zero is the root and belongs to generation 0. The 
tail of an arrow is from the ''father" or the infector, whereas 
the head of an arrow points to the ''son" or the infectee. 
If a father belongs to generation i, then its children lie 
in generation i -\- 1. In a worm tree with n nodes, we 
use Ln{i^j) (0 < j < n — 1) to denote the number 
of nodes that have i children and belong to generation 
j. Note that YhZq EjzTo Ln{iJ) = n. We also use Cn{i) 
(z = 0,l,2,---,n — l)to denote the number of nodes that 
have i children and Gn{j) {j = 0, 1, 2, • • • , n — 1) to denote 
the number of nodes in generation j. Moreover, Ln(i-,j), 
Cn{i), and Gn{j) are random variables. Thus, we define 
Pn{hj) = ^i^^^^i^^, representing the joint distribution of 
the number of children and the generation. Similarly, we 

define Cn{i) = ^^^^^^^^ to represent the marginal distribu- 
tion of the number of children and gn{j) = ^^^^^"^"^^ to 
represent the marginal distribution of the generation. Note 
that Cn{i) = Ej=o Pn{iJ) and gn{j) = Yl7^oPn{hj)- 

Although we model worm infection as a tree, differ- 
ent worm trees can show very different structures. Fig. |2] 
demonstrates two extreme cases of worm trees. Specifically, 
in Fig. |2] (a), each infected host compromises one and only 
one host except the last infected host. In this case, if the 
total number of nodes is n, Cn(0) = 1, and Cn(l) = n — 1, 
which lead to Cn(0) ^ and Cn(l) ^ 1 when 

n is large. That is, almost each node has one and only 
one child. Moreover, Gn{j) = 1, j = 0, 1, 2, • • • , n — 1, 
which means that gn{j) = ^- Thus, the average path length 
is YJjZ] j • 9n{j) = ^ ~ Oin). That is, the average 
path length increases linearly with the number of nodes. 
Comparatively, Fig.|2](b) shows another case where all hosts 
(except patient zero) are infected by patient zero. For the 
distribution of the number of children, Cn(n — 1) = ^, 
and Cn(0) = ^ 1 when n is large, indicating that 

almost every node has no child. For the distribution of the 
generation, ^^(0) = ^, and ^^(1) = which leads to 
that the average path length is ~ 1 when n is large. 
Thus, the path length is close to a constant of 1. In this 
work, we attempt to identify the structure of the worm 
tree formed by Internet worm infection. 

To study the worm tree analytically, in this paper we 
make several assumptions and considerations. First, to 
simplify the model, we assume that infected hosts have 
the same scanning rate. This assumption is removed in 
Section m where we use simulations to study the effect of 
the variation of scanning rates on the worm tree. Second, 
we consider a wide class of worms for which a new 
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Fig. 2. Two extreme cases of worm trees. 



victim is compromised by each existing infected host with 
equal probability. Such worms include random-scanning 
worms, routable-scanning worms, importance-scanning 
worms, OFT-STATIC worms, and SUBOPT-STATIC worms. 
Random scanning selects targets in the IPv4 address space 
randomly and has been the main scanning method for both 
worms and botnets (H, 0; routable scanning finds victims 
in the routable IPv4 address space ITSl , l9l : and impor- 
tance scanning probes subnets according to the vulnerable- 
host distribution CSI. OPT-STATIC and SUBOPT-STATIC 
are optimal and suboptimal scanning methods that are 
proposed in {V7\ to minimize the number of worm scans 
required to reach a predetermined fraction of vulnerable 
hosts. In Section 14.31 we extend our study to localized 
scanning, which preferentially searches for targets in the 
local subnet and has also been used by real worms fTSl , 
1191 . Third, we consider the classic susceptible infected 
(SI) model, ignoring the cases that an infected host can be 
cleaned and becomes vulnerable again, or can be patched 
and becomes invulnerable. The SI model assumes that once 
infected, a host remains infected. Such a simple model has 
been widely applied in studying worm infection (H, l9l , 
121], {\7\. and presents the worst case scenario. Fourth, we 
assume that there is no re-infection. That is, if an infected 
host is hit by a worm scan, this host will not be further 
re-infected. As a result, every infected host has one and 
only one father except for patient zero, and the resulting 
graph formed by worm infection is a tree. Fifth, we assume 
that the worm starts from one infected host, i.e., patient 
zero or a hitlist size of 1. When the hitlist size is larger 
than 1, the underlying infection topology is a worm forest, 
instead of a worm tree. Our analysis, however, can easily 
be extended to model the worm forest. Finally, to simplify 
the analysis, we assume that no two nodes are added to 
the worm tree at the same time. That is, no two vulnerable 
hosts are infected simultaneously. We relax this assumption 
in Section |4] where simulations are performed. 

Based on these considerations and assumptions, the se- 
quential growth model of a worm tree works as follows: 
We consider a fixed sequence of infected hosts {i.e., nodes) 
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^1,^2,-'' ^rid inductively construct a random worm tree 
(Tn)n>i/ where n is the number of nodes and Ti has only 
patient zero. Infecting a new host is equivalent to adding a 
new node into the existing worm tree. Hence, given T^_i, 
Tn is formed by adding node Vn together with an edge 
directed from an existing node Vf to Vn- According to the 
assumption, Vf is randomly chosen among the n — 1 nodes 
in the tree, i.e., Pr(/ = k) = k = 1,2,--- ,n — 1. 

Note that such a growth model and its variations have 
been widely used in studying topology generators ||22l, 1231 . 
In this paper, we apply this model to characterize worm 
infection. 

3 Mathematical Analysis 

In this section, we study the worm tree through mathe- 
matical analysis. Specifically, we first derive the joint dis- 
tribution of the number of children and the generation, i.e., 
Pn{hj)f by applying probabilistic methods. We then use 
Pnihj) to analyze two marginal distributions, i.e., Cn{i) and 
9n{j)f and obtain their closed-form approximations. Finally, 
we find a closed-form approximation to Pnihj)- 

3.1 Joint Distribution 

For a worm tree with only patient zero {i.e., n = 1), since 
Li(0,0) = 1 with probability 1, J9i(0,0) = 1. Similarly, 
for a worm tree with n = 2, it is evident that L2(l,0) = 
L2(0,l) = 1. Thus, P2(l,0) = P2(0,l) = |. We now 
consider (0 < j < n — 1) when n > 3. Specifically, 

we study two cases: 

(1) Pn(0, j), i.e., the proportion of the number of leaves 
in generation j in T^. Assume that T^-i is given, and 
there are Ln-i(0,j) leaves in generation j and totally 
Gn-iU - 1) = Yli^o ^n-i{ij - 1) nodes in generation 
j — 1. Note that we have extended the notation so that 
Gn-i(-l) = Ln-i{i,-l) = 0, < i < n - 2. When a 
new node Vn is added, Vn becomes a leaf of T^. If Vn is 
connected to one of existing nodes in generation j — 1, 
Vn belongs to generation j; and the probability of such 



an event is 



Gr^-lU-l) 



Moreover, if a leaf in generation 
j in Tn-i connects to Vn, this node is no longer a leaf 
and now has one child; and the probability of this event is 
Lri-iioj) jYiQYeiore, we can obtain the stochastic recurrence 

n— 1 ' 

of i„(0,j): 



Ln{0,j) = { 



Ln-l{0,j), 



1, 
1, 



w.p. 
w.p. 
otherwise. 



i(Qj) 



(1) 



Given T^-i {i.e., Ln-i(0, j) and Gn-i{j - 1)), the con- 
ditional expected value of Ln(0,j) is [Ln-i(0, j) + 1] • 
^ [L„_i(0,j)-1] • + J^n-i(0,j) • 

l)+£„-i(Oj) 



1 



-1 



i(i- 



1] 

, I.e., 



E[L„(0, j)|T„-i] = ■ 
Applying E[L„(0,j)] 



^L„_i(0,j) + 



E[E[L„(0,i)|T„. 
total expectation), we obtain 

nLn{0,j)] = ^E[L„_i(0,i)] + ;^E[G„_i(i 



i(i-l)- (2) 
{i.e., the law of 




Generation 
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Fig. 3. Joint distribution of the number of children and the 
generation (n = 2000). 



Using the definitions Pn{OJ) = ^^^"^^'-^^^ and gn-i{j - 

1) = = Er=o^Pn-i(^, j - 1), the above equa- 
tion leads to 

Pn(0,j) = ^Pn-l(0,j) + ^^n-l(j-l) (4) 

= ^Pn-l(0,j) + ^ Er=o'Pn-l(^, J - 1).(5) 

{l)Pn{i,j), l<i< n-1. Given Ln-i{iJ) and Ln-i{i- 
1, j) in Tn-i, we study Ln{i^j) in T^. When the new node 
Vn is added into Tn-i, Vn is connected to a node with i — 1 
children and in generation j with probability 
or is connected to a node with i children and in generation 
j with probability ^^^^^^z^^. Thus, in Tn, 



^n-l(^,j) + l, W.p. 



Ln-i{iJ)-l, W.p. 
This relationship leads to 



-i(i-ij) 

n-l 



(6) 



otherwise. 



E[L„(i,j)|r„_i] = n^Ln-i{iJ) + ^Ln-i{i-lJ). (7) 
Therefore, 

nLnihj)] = ^E[Ln-l{iJ)]^^E[Ln-l{i-lJ)]. (8) 

That is, 

Pn{iJ) = ^Pn-l{iJ) + ^Pn-l(^ - 1, j)- (9) 

Summarizing the above two cases, we have the following 
theorem: 

Theorem 1: When n > 3, the joint distribution of the 
number of children and the generation in a worm tree Tn 
follows 



Pn{iJ) 



^Pn-l(0,j) + ^^n-l(j-l), ^ = 



1) 



(3) 



^Pn-i{iJ) + ^Pn-i(^ - 1, j), Otherwise, 

(10) 

where < j < n — 1. 
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Theorem [T] provides a way to calculate Pn{hj) recur- 
sively from P2{hj)' Fig- E] shows a snapshot of Pn{hj) 
when n = 2000. It can be seen that when the genera- 
tion is specified {i.e., j is fixed), Pn{hj) is a monotonous 
function and decreases quickly as i increases. On the 
other hand, when the number of children is given {i.e., 
i is fixed), Pn{hj) has a bell shape. Moreover, since 
Y^]%Yl]LoPn{hj) = 0.9976, most nodes do not have a 
large number of children, and the worm tree does not have 
a large average path length. 

3.2 Number of Children 

We use Pn{hj) to derive the marginal distribution of the 
number of children, i.e., Cn{i). Similarly, we study two 
cases: 

(1) Cn(0), i.e., the proportion of the number of leaves in 
Tn. Since Cn(0) = Epo Pn(0, j) and Ej=o ^n-i(j-l) = 1, 
we obtain the recursive relationship of Cn{0) from Equation 
®: 

(11) 



0.5- 



Cn(0) = ^i^C„_i(0) + i. 



Moreover, note that C2(0) = |. If we assume that c„_i(0) 
^, we can obtain by induction that 



c„(0) 



(12) 



This indicates that no matter how many nodes are in the 
worm tree, on average half of nodes are leaves, i.e., on 
average 50% of infected hosts never compromise any target. 
(2) Cn{i), 1 < i < n — 1. From Equation ^ and Cn{i) = 



Y^1=Q Pn{h j)f we find the recurrence of Cn{i) as follows 



n-2 



(13) 



Summarizing the above two cases, we have the following 
theorem on the distribution of the number of children: 

Theorem 2: When n > 3, the distribution of the number 
of children in a worm tree Tn follows 







1 

2 ' 

^Cn-l(i)^\Cn-l{i-l), l<i<n 



1. 

(14) 



From Theorem |2l we can derive the statistical properties 
of the number of children as follows. 

Corollary 1: When n > 1, the expectation and the vari- 
ance of the number of children are 
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Fig. 4. Distribution of the number of children. 



Theorem |2] also leads to a simple closed-form expression 
of the distribution of the number of children when n is very 
large, as shown in the following corollary. 

Corollary 2: When n ^ oo, the number of children has a 
geometric distribution with parameter |, i.e., 

.2> 



/In i+l 

c{i) = \im Cn{i) = [-) , i = 0,1,2,- 



(17) 



En[C]=Er=0 ^•Cn« = ^ (15) 

Var„[C] = Er=o (i " ^n[C]f ■ c„{i) = 2-^-^, 

(16) 

where Hn = Yl7^i 7 is the n-th harmonic number 1201 . 

The proof of Corollary [T] is given in Appendix 1. One 
intuitive way to derive En[C] is that in worm tree Tn, there 
are n — 1 directed edges and n nodes. Thus, the average 
number of edges {i.e, the average number of children) 
of a node is Moreover, since Hn is 0(1 + Inn), 

lim En[C] = 1, and lim Varn[C] = 2. n(j) 



The proof of Corollary |2l is given in Appendix 2. Corol- 
lary |2l indicates that when n is very large, Cn{i) decreases 
approximately exponentially with a decay constant of In 2 
as the number of children increases. 

We further study when both n and i are finite and 
large, how Cn{i) varies with n, i.e., how the tail of the 
distribution of the number of children changes with n. First, 
note that 03(0) = ^, cs{l) = |, and cs{2) = |. Thus, 
from Equation ([T3t , we can prove by induction that Cn{i) 
{n > 3) is a decreasing function of i, i.e., Cn{i) < Cn{i — I), 
for 1 < i < n — 1. Next, putting this inequality into 
Equation JTSj, we have Cn{i) > ^^^Cn-i(^). Hence, when 
n is very large, ^ 1, and Cn{i) > Cn-i{i), which 

indicates that the tail of Cn{i) increases with n. Fig. |4] verifies 
this result, showing Cn{i) obtained from Theorem |2] when 
n = 1000, 2000, 5000, and 20000, as well as the geometric 
distribution with parameter 0.5 obtained from Corollary 
121 Note that the y-axis uses log-scale. It can be seen that 
when n increases from 1000 to 20000, the tail of Cn{i) also 
increases to approach the tail of the geometric distribution. 
Moreover, it is shown that the geometric distribution well 
approximates the distribution of the number of children 
when n is large. 

3.3 Generation 

Next, we derive the generation distribution {i.e., gn{j)) in a 
similar manner to the case of Cn{i). Using Theorem [T] and 
9n{j) = 127=0 Pn{h j)r obtain the following theorem: 

Theorem 3: When n > 3, the distribution of the genera- 
tion in a worm tree Tn follows 



L(i) 



n9n 



-i(j-l),0<i<n-l, (18) 
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where ^n-i(-l) = 0. 

Theorem |3] gives a method to calculate the distribution 
of the generation recursively. Moreover, from Theorem |3l 
we can derive the statistical properties of the generation 
distribution in the following corollary. 

Corollary 3: When n > 1, the expectation and the vari- 
ance of the generation are 

En[C] = E;Jo j ■ 9n{j) =Hn-l. (19) 

Var,[G] = E;Jo 0' - ^n[G]f • gn{j) = - i^n,2, (20) 

where Hn = Yli=i i and i^n,2 = ElLi ^• 

The proof of Corollary |3] is given in Appendix 3. From 
Corollary |3l we have some interesting observations. Since 
Hn is 0(1 + Inn) and i^oo,2 = C(2) = ^ ^ 1-645 is the 
Riemann zeta function of 2 |24l, both E^[G] and Var^[G] 
are 0(1 + In n). This indicates that the average path length 
of the worm tree {i.e., En[G]) increases approximately loga- 
rithmically with n. Moreover, when n ^ oo, lim [G] — 
Inn = 7 — 1, and lim Varn[0] — Inn = 7 — C(2), where 

n^oo 

7 ^ 0.577 is the Euler-Mascheroni constant |25J. Therefore, 
when n is large, E^ [G] ~ Var^ [G] . Furthermore, we can use 
Theorem |3] to obtain a closed-form approximation to gn{j) 
as follows. 

Corollary 4: When n is very large, the generation distri- 
bution gn{j) can be approximated by a Poisson distribution 
with parameter = En[0] = Hn — 1. That is, 

5n(i)«fe-^", 0<i<n-l. (21) 

The proof of Corollary |4] is given in Appendix 4. Fig. |5] 
verifies Corollary HI showing gn{j) obtained from Theorem 
|3] when n = 1000, 2000, 5000, and 20000, as well as 
the Poisson distribution with parameter E^[0]. It can be 
seen that when n is large, the Poisson distribution fits the 
generation distribution closely. 



3.4 Approximation to the Joint Distribution 

Finally, we derive a closed-form approximation to the joint 
distribution Pn{hj)- From Equation ©, we can see that 
when n ^ 00, Pn{hj) = Pn-i{hj), which yields 

Vn{i.3) = \Vn{i-l.3)- (22) 
Hence, we can obtain 

Pn{hj) = (^)'Pn(0,j) ^ {\y^\n{j)- (23) 

Since when n is very large, gn{j) follows closely the Poisson 
distribution as in Corollary HI 

Pn(i,i)«(i)'^'-fe-^", 0<i,i<n-l, (24) 

where = Hn — 1. The above derivation also shows 
that when n is very large, the number of children and the 
generation are almost independent random variables. 

Fig. [6] shows the parity plot of the approximation to the 
joint distribution when n = 2000. In the figure, the x-axis 
is the actual Pn{hj) obtained from Theorem [TJ and the y- 
axis is the approximated Pn(^,j) from Equation (|24)l , where 
< i, j < 30. It can be seen that most points are on or near 
the diagonal line, indicating that the approximation to the 
joint distribution is reasonable. 

4 Simulations and Verification 

In this section, we study the worm infection structure 
through simulations. As far as we know, there is no publicly 
available data to show the real worm tree and verify our an- 
alytical results. Moreover, real experiments in a controlled 
environment are impractical for this study since the closed- 
form approximations are derived based on the assumption 
that the number of nodes is very large. Therefore, we apply 
empirical simulations. Specifically, we first simulate the 
infection structure of the Code Red v2 worm and then 
study the effects of important parameters on the worm tree. 
Finally, we extend our simulation to localized-scanning 
worms. 
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Fig. 7. Simulating the infection structure of the Code Red v2 worm (no = 360000). 
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4.1 Code Red v2 Worm Verification 

We simulate the propagation of the Code Red v2 worm 
by using and extending the simulator in f2S\. Specifically, 
the simulator considers a discrete-time system and mimics 
the random-scanning behavior of infected hosts during 
each discrete time interval. Moreover, the parameter setting 
is based on the Code Red v2 worm's characteristics. For 
example, the vulnerable population is no = 360, 000, and a 
newly infected host is assigned with a scanning rate of 358 
scans /min. Detailed information about how the parameters 
are chosen can be found in Section VII of (27|. We then 
extend the simulator to track the worm infection structure 
by adding the information of the number of children and 
the generation to each infected host. Moreover, we set the 
time unit to 20 seconds and start our simulation at time tick 
with patient zero. Note that we remove the assumption 
used in the sequential growth model that no two hosts are 
compromised at the same time. That is, multiple hosts can 
be compromised at one time tick. Moreover, all new victims 
of the current time tick start scanning at the next time tick. 
The simulation results (mean ± standard deviation) are 
obtained from 100 independent runs with different seeds 
and are presented in Fig. [71 

Fig.[7fa) shows the distribution of the number of children, 
comparing the simulation results of Cn{i) for n = no/4, 
no, and 4no with the geometric distribution obtained from 
Corollary |2l Note that the y-axis uses the log-scale. The 
dotted line represents the standard deviation that goes into 
the negative territory. It can be seen that the distribution 
of the number of children can be well approximated by the 
geometric distribution with parameter 0.5. This implies that 
Cn{i) decreases approximately exponentially with a decay 
constant of In 2. Specifically, in all three cases, on average 
50.0% of the infected hosts do not have children, about 
98.4% of them have no more than five children, and 0.1% 
of them have no less than ten children. We also calculate 
the expectation and the variance of the number of children 
from the simulation and find that they are identical to 
the analytical results obtained from Corollary [T] Fig. [Ztb) 
demonstrates the generation distribution, comparing the 
simulation results of gn{j) for n = no/4, no, and 4no with 



the Poisson distributions with parameter En[G] = Hn — 1 
obtained from Corollary H] It can be seen that the simulation 
results of gn{j) closely follow the Poisson distributions for 
all three cases. Hence, simulation results verify that the 
average path length of the worm tree increases approxi- 
mately logarithmically with the total number of infected 
hosts. Moreover, we also compute the expectation and the 
variance of the generation in simulations and verify the 
analytical results in Corollary |3l Fig. [Tfc) compares the 
measured joint distribution from simulations with the ap- 
proximated joint distribution from Equation (|24)l by using 
the parity plot. It can be seen that most points are on or 
near the diagonal line, indicating that the approximation 
works well. 

4.2 Effects of Worm Parameters 

Next, we extend our simulator to examine the effects of 
three important parameters of worm propagation on the 
worm tree: the scanning rate, the scanning rate standard 
deviation, and the hitlist size. When a parameter is studied 
and varied, we set other parameters to the parameters of 
the Code Red v2 worm as used in Section l4Jl The simula- 
tion results are obtained from 100 independent simulation 
runs and are shown in Fig. |8l 

Fig.slHa) and (b) show the effect of varying the scanning 
rate s (scans/min) from 158 to 558 on the distributions 
of the number of children and the generation. Here, the 
scanning rate is set to a fixed value for every infected 
host, i.e., the scanning rate standard deviation is 0. The 
figures also plot the geometric distribution with parameter 
0.5 and the Poisson distribution with parameter i^^o ~ 1 
for reference. It can be seen that the scanning rate does not 
affect the worm tree structure. 

Fig.s HIc) and (d) demonstrate the effect of the variation 
of the scanning rates among different hosts {i.e., a). In 
our simulation, a newly infected host is assigned with 
a scanning rate (scans/min) from a normal distribution 
A^(358,cr^). The figures show the simulation results when 
(J = 0, 100, and 200. It can be seen that while the 
scanning rate standard derivation a has no effect on the 
generation distribution, it does affect the distribution of 
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Fig. 8. Effects of the scanning rate, the scanning rate standard deviation, and the hitlist size on the distributions of the 
number of children and the generation (no = 360000). 



the number of children. Specifically, when a increases, the 
tail of Cn{i) moves upward from the geometric distribution 
with parameter 0.5. This is because when a becomes larger, 
the variation of the scanning rate among infected hosts is 
greater. That is, there are more hosts with high scanning 
rates and also more hosts with low scanning rates. As a 
result, those hosts with high scanning rates tend to infect 
a large number of hosts, making the tail of Cn{i) move up- 
ward. However, it is also observed that when a is not very 
large (the case for real worms), the geometric distribution 
with parameter 0.5 is still a good approximation. 

In Fig.s lite) and (f), we show the effect of the hitlist 
size on the worm tree. As pointed out in Section El when 
the hitlist size is greater than 1, the underlying infection 
topology is a worm forest with the number of trees equal 
to the hitlist size. Moreover, in a worm forest, it is intuitive 
that each tree is a smaller version of the single worm 
tree of hitlist size 1 and has fewer nodes. Hence, it is not 
surprising to see that in Fig.^f), the generation distribution 
moves leftward when the hitlist size increases. However, 
the generation distribution can still be well approximated 
by the Poisson distribution with parameter — 1, where 
Uh is the average number of nodes in a tree. Moreover, 
since in each tree the distribution of the number of children 
can be approximated by the geometric distribution with 



parameter 0.5, in the worm forest Cn{i) still follows closely 
the same distribution. 

4.3 Localized Scanning 

Finally, we extend our simulation study to the infection 
tree of localized-scanning worms. Different from random 
scanning, localized scanning preferentially searches for tar- 
gets in the 'TocaF' address space [8J. As a result, when a 
new node is added to the worm tree, it connects to one 
of the existing nodes that are in the same 'Tocal" address 
space with a higher probability. That is, the growth model 
is no longer uniform attachment as studied in Section |3l For 
simplicity, in this paper we only consider the // localized 
scanning 1191 : 

• Local scanning: Pa{0 < Pa < 1) of the time, a "local" 
address with the same first / (0 < / < 32) bits as the 
attacking host is chosen as the target. 

• Global scanning: 1 — Pa of the time, a random address 
is chosen. 

Note that random scanning can be regarded as a special 
case of localized scanning when pa = 0. Moreover, if local 
scanning is selected, it can be regarded as random scanning 
in a local // subnet. It has been shown that since the 
vulnerable-hosts distribution is highly uneven, localized 
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(a) Number of children. (b) Generation. (c) Joint distribution {pa = 0.6). 

Fig. 9. Simulating the infection structure of the localized-scanning worm (n = 360000, s = 358 scans/min, a = 0, hitlist = 
1, and / = 8). 



scanning can spread a worm much faster than random 
scanning IMJ, [21]. 

We extend our simulator to imitate the spread of 
localized-scanning worms. We extract the distribution of 
vulnerable hosts in // subnets from the dataset provided 
by DShield ||28l, ||29l. Specifically, we use the dataset in 
lE9l with port 80 (HTTP) that is exploited by the Code 
Red worm to generate the vulnerable-host distribution. 
Moreover, we use similar parameters as in Section l4Jl (g.g., 
n = 360000, s = 358 scans/min, a = 0, and hitlist = 1) 
and set the subnet level to 8 {i.e., I = S). The results are 
obtained from 100 independent simulation runs and are 
shown in Fig. |9] For each run, patient zero is randomly 
chosen from vulnerable hosts. 

Fig. lUa) compares the simulation results of the distribu- 
tions of the number of children {i.e., Cn{i)) when pa = 0, 0.3, 
and 0.6 with the geometric distribution with parameter 0.5. 
It is surprising that Cn{i) of localized-scanning worms can 
still be well approximated by the geometric distribution. 
That is, the majority of nodes have few children, whereas a 
small portion of compromised hosts infect a large number 
of hosts. An intuitive explanation is given as follows. From 
Fig. [Tfa), it can be seen that the total number of nodes has a 
minor effect on Cn(^). Hence, if in a /8 subnet the majority 
of vulnerable hosts are infected through local scanning, it 
is expected that Cn{i) of these hosts still closely follows 
the geometric distribution since the local scanning can be 
regarded as random scanning inside a /8 subnet. Therefore, 
both local infection and global infection lead Cn{i) towards 
the geometric distribution with parameter 0.5. On the other 
hand, it can also be seen that when pa increases, the tail 
of Cn{i) moves slightly downward. This is because as Pa 
increases, more vulnerable hosts are infected through local 
scanning. Hence, it is more difficult for an infected host to 
find targets after vulnerable hosts in this host's local subnet 
have been exhausted. As a result, when pa increases, fewer 
nodes can have a large number of children. 

Fig. IHb) demonstrates that the generation distribution 
of localized-scanning worms {i.e., gn{j)) can be well ap- 
proximated by the Poisson distribution for the cases of 
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Fig. 10. Effect of the subnet level (pa = 0.6). 



Pa = 0, 0.3, and 0.6. The Poisson parameter, however, 
depends not only on n, but also on pa. We further define 
A^'' = E^"" [G] as the expectation of the generation for a 
localized-scanning worm with parameter pa. Here, E^'' [G] 
can be easily estimated from the simulation results of gn{j)- 
Fig- He) further shows the parity plot of the simulated joint 
distribution and the approximated joint distribution from 
Equation (|24)l when pa = 0.6. Since most points are on or 
near the diagonal line, the approximation is reasonable. 

Moreover, Fig. [lO] shows the effect of the subnet level {i.e., 
I) on the distribution of the number of children {i.e., Cn{i))- 
It can be seen that when / increases, the tail of Cn{i) moves 
downward. The reason is similar to the argument used in 
Fig. HJa), i.e., as / increases, fewer nodes can infect a large 
number of children. However, the figure also demonstrates 
that the geometric distribution with parameter 0.5 is still a 
good approximation to Cn{i), especially when the number 
of children is not large. 

5 Applications of Observations 

Our observations on the topologies formed by worm in- 
fection have important implications and applications for 
both defenders and attackers. For example, we have found 
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that the generation distribution closely follows the Poisson 
distribution and the average path length increases approx- 
imately logarithmically with the number of nodes. On one 
hand, some schemes have been proposed to trace worms 
back to their origins through the cooperation between in- 
fected hosts ||30l, ||3ll, and our work quantifies the average 
path length that describes a lower bound of the number 
of hosts required to cooperate. On the other hand, this 
average path length reflects the delay or the effort for a 
botmaster to deliver a command to all bots in a P2P-based 
botnet like Conficker C, and our results show that the 
botnet is scalable and can efficiently forward commands 
to a large number of bots. In this section, we focus on the 
applications of the distribution of the number of children 
for both defenders and attackers. Specifically, we study a 
simple and efficient bot detection method in a Conficker 
C like P2P-based botnet and consider a countermeasure by 
future botnets. 

5.1 Bot Detection 

We consider a P2P-based botnet formed by worm scan- 
ning/infection. That is, once a host infects another host, 
they become peers in the resulting P2P-based botnet. When 
a defender captures an infected host in a botnet, the de- 
fender can process the historic records inside the host or 
monitor the traffic going into or out of the host, and will 
potentially detect other infected hosts such as the father and 
the children of this infected host. Then, our question is that 
if a defender can only access a small portion of nodes in 
a botnet, how many bots will be detected by the defender. 
Moreover, inspired by the random removal and targeted 
removal methods used in analyzing the robustness of a 
topology fTS\, here we study two bot detection strategies: 

• Random detection: Access bots randomly. 

• Targeted detection: Access bots that have the largest 
number of children. 

Analytically, we suppose that a defender can access a 
small ratio of bots in a botnet. We assume that an accessed 
bot exposes itself, its father, and its children to the defender. 
To simplify the analysis, we also assume that the accessed 
bot ratio. A, is a power of 0.5 and all exposed nodes are dif- 
ferent nodes. We then calculate the average percentages of 
exposed bots by random detection and targeted detection. 

Since from Corollary [T] a randomly selected node has 
approximately one child, the average percentage of bots 
that can be exposed by random detection is then 



Dr = 3A. 



(25) 



For targeted detection, since the nodes with the largest 
number of children are chosen and the number of chil- 
dren follows asymptotically a geometric distribution with 
parameter 0.5 as shown in Corollary |2l 



i+1 



(26) 



where d is the smallest number of children of accessed 
nodes. That is, d = — log2 A. Therefore, the average per- 
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centage of exposed nodes by targeted detection is 



D 



Er=A^^^)'Cn(^) = {d^3)c^y 



A(3-log2 A). 

(27) 

Compared with random detection, targeted detection can 
expose {—A log2 A)xn more nodes. For example, i£ A = 
on average random detection can detect 4.69% of nodes, 
whereas targeted detection can expose 14.06% of bots. 

We then extend our simulation in Section l4Jl to study the 
effectiveness of random and targeted detection strategies. 
Fig. [TT] shows the simulation results over 100 independent 
runs for both strategies, as well as the analytical results 
from Equations (|25)l and (|27)>, when A = and jq. It 

can be seen that the analytical results slightly overestimate 
the exposed host percentage. This is because in our analysis 
we ignore the case that two exposed nodes can be duplicate. 
Fig. [TT] also demonstrates that targeted detection performs 
much better than random detection. For example, in our 
simulation, when A = 3.125%, 9.10% of the bots are 
exposed under random detection, whereas 22.36% of the 
bots are detected under targeted detection. Therefore, when 
a small portion of bots are examined, the botnets formed 
by worm infection are robust to random detection, but are 
relatively vulnerable to targeted detection. 



5.2 A Countermeasure by Future Botnets 

To counteract the targeted detection method, an intuitive 
way for botnets is to limit the maximum number of children 
for each node. That is, set a small number m. Once an 
infected host has compromised m other hosts, this host 
stops scanning. In this way, there is no node with a large 
number of children. Moreover, the infected hosts can self- 
stop scanning, potentially reducing the worm traffic (321 . 

To analyze the robustness of such botnets against tar- 
geted detection, we extend Corollary |2] to obtain an approx- 
imated distribution of the number of children in a botnet 
with the countermeasure: 



(^) 
C2) 



z = 0, 1, 2, • • • , m — 1 



(28) 
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Fig. 12. A worm countermeasure via limiting the maximum number of children. 



The distribution is based on the observation that those 
nodes having more than m children in a botnet without 
the countermeasure can now have only m children. Hence, 
the expected percentage of exposed nodes under targeted 
detection can be calculated: 



(m + 2) • A, A< 
A(3-log2A)-(i)^, A>(i) 



(29) 



Compared with Dt in Equation (|27)l , D'j. is smaller. This 
means that under the countermeasure the number of ex- 
posed nodes can be reduced significantly. For example, 
when m = 3 and A = Dt = and D'j^ = 

We then extend our simulation in Section 15.1! to simulate 
the worm tree generated using the above countermeasure 
and evaluate its performance against targeted detection. 
Fig- [T2t a) shows the distribution of the number of children 
when m = 2, 3, 4, and 5. It can be seen that except for 
m = 2, Cn{i) is well approximated by Equation (|28l . For 
m = 2, since an infected host stops scanning when it has 
hit two vulnerable hosts, leaves in the worm tree have 
more chances to recruit a child. Fig. \T2\ h) demonstrates 
the expected percentage of exposed nodes {i.e., D'j), when 
A = ^, and jq, and m = 2, 3, 4, and 5. It can 
be seen that D'j. follows approximately the analytical re- 
sults in Equation (|29t . Moreover, the expected percentage 
of exposed nodes under the countermeasure is reduced 
significantly. For example, when A = the percentage 
is reduced from 22.36% without the countermeasure to 
19.80%, 15.99%, 12.58%, and 9.38% when m = 5, 4, 3, and 
2, respectively. 

On the other hand, since not every infected host keeps 
scanning the targets, the countermeasure can potentially 
slow down the speed of worm infection. Thus, we also 
simulate the propagation speed of worms that limit the 
maximum number of children and plot the results in Fig. 
[T2lc) for m = 2, 3, 4, and 5, as well as the original worm 
without the countermeasure. It can be seen that except for 
m = 2, the worm does not slow down much. But even 
when m = 2, the worm can infect most vulnerable hosts 
within 17 hours. Moreover, Fig.s [T2l b) and (c) demonstrate 
the tradeoff between the efficiency of worm infection and 



the robustness of the formed botnet topology. That is, a 
worm with the countermeasure spreads slower, but the 
resulting botnet is more robust against targeted detection. 

6 Related work 

Since the Code Red worm in 2001, Internet worms have 
been an active research topic. Many mathematical models 
have been developed to characterize the spread of worms, 
estimate worm behaviors, and contain worm propagation. 
Most models, however, have focused on the macro-level 
behavior of worm infection. Specifically, different analytical 
approaches have been applied to study the total number 
of infected hosts over time ID, lU, |[10|, im, d, Q, 
[27J. For example, Staniford et al. used a simple differential 
equation to estimate the global propagation speed of the 
Code Red v2 worm [8|, whereas Rohloff et al. applied a 
stochastic model to reflect the variation of the number of 
infected hosts at the early stage of worm infection The 
models of the micro-level of worm infection, however, have 
been investigated little. In this paper, we apply probabilistic 
modeling methods and reveal some key micro-level infor- 
mation, such as the infection ability of individual hosts and 
the underlying botnet topology formed by worm infection. 

Some efforts have been focused on studying the ''who 
infects whom" information or the worm infection sequence 
t3Qll , t33ll , (3111 , |3il. Different from our work, the prior work 
investigates the details of the random number generator 
of worm propagation (301 or infers the worm infection 
sequence through the observations of network telescopes 
(331 , 1341 . Moreover, Sellke et al. applied a branching process 
to study the effectiveness of a containment strategy l35l . 
They assume that the total number of scans of an infected 
host is bounded. As a result, the worm tree studied in their 
work is fundamentally different from the one in our work. 

Botnets have become the top threat to the Internet in 
recent years. It has been shown that in current botnets, 
worm infection is still a main tool for recruiting new bots 
or collecting network information, and random scanning 
has been widely used ||3l. Moreover, botnets are rapidly 
transiting from IRC systems to P2P systems. In 136], Wang et 
al. gave a systematic study on P2P-based botnets; whereas 
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in 1141 , Dagon et al. surveyed different P2P-based botnet 
topologies, such as random graphs and power-law topolo- 
gies. Several methods have been proposed to construct P2P- 
based botnets through worm infection and re-infection H), 

m 

Modeling the topology generation process has been an 
active research area. For example, Barabasi et al. developed 
the well-known Barabasi-Albert (BA) model and used a 
mean-field approach to characterize the growth of a topol- 
ogy with both preferential attachment and uniform attach- 
ment 1221, 1231 . Moreover, two exact mathematical models 
have been studied for the BA model ||37|, L38J. From the 
theoretical aspect, our proposed worm tree is similar to the 
random tree. For example, Devroye used the records theory 
to derive the distribution of the level of a random ordered 
tree in 1391 . Compared with these theoretical efforts, our 
work studies a very different problem {i.e., botnets formed 
by worm infection) and uses a very different approach {i.e., 
probabilistic modeling). 

7 Conclusions 

In this paper, we attempt to capture the key characteris- 
tics of the Internet worm infection family tree and apply 
them to bot detection. We have shown analytically and 
empirically that for the infection tree formed by a wide 
class of worms, the number of children asymptotically has 
a geometric distribution with parameter 0.5; and the gener- 
ation closely follows a Poisson distribution with parameter 
E^[G] {i.e., Hn — 1). As a result, on average half of infected 
hosts never compromise any target, over 98% of nodes 
have no more than five children, and a small portion of 
hosts have a large number of children. Moreover, the aver- 
age path length of the worm tree increases approximately 
logarithmically with the number of nodes. We have also 
demonstrated empirically that similar observations can be 
found in localized-scanning worms. We have then applied 
the observations to bot detection and found that targeted 
detection is an efficient way to expose bots in a botnet. 
However, we have also pointed out that a simple counter- 
measure by future botnets can weaken the performance of 
targeted detection, without greatly slowing down the speed 
of worm infection. 

As part of our ongoing work, we plan to study in more 
depth efficient methods against future botnets and relax our 
assumptions to include more worm dynamics. For example, 
we are studying the effect of user defenses on the worm tree 

iQi. 

APPENDIX 1 : Proof of Corollary 1 

We apply z-transform to derive the expectation and the 
variance of the number of children. First, note that Corol- 
lary [T] holds for n = 1 and 2. Next, when n > 3, we define 
z-transform 

Xn{z) = Er=0 Cn{i)z-\ (30) 

Setting Cn-i(— 1) = 1, we can transform Theorem |2] to 

^-(0 = ^Cn-i{i) + lcn-i{i - 1), < i < n - 1, (31) 



when n > 3. Then, putting Equation ||3B into Equation (|30l) , 
we can obtain the difference equation of z-transform 

= (Iz-i + ^) + 1 (32) 

Note that En[C] = -^^^j^ U=i and Xn-i(l) = 1, which 
leads to 

En[C] = ^^E„_i[C] + i. (33) 
Since E2[C] = \, we can show by induction that 

En[C] = ^. (34) 
Moreover, E„[C2] = ± [z^^] U=i yields 

MC^] = ^E„_i[C2] + fE„_i[C] + i (35) 
= i^E„_i[C2] + M. (36) 

Thus, we can use E2 [C^] = 5 to prove by induction that 

E„[C^] = 2+ ("-^)("-^) -M^, (37) 

where Hn = \ is the n-th harmonic number (20l . 

Therefore, 

Varn[C] = En[C2]-E^[C] (38) 



APPENDIX 2: Proof of Corollary 2 

It is already known that c(0) = \. When i > 1, this 
corollary follows readily from Equation |[T3l) . Since n ^ oo, 

Cn-i{i) = Cn{i) = c{i), which yields 

c(i) = ^cii) + lc(i - 1). (40) 

That is, 

c{i) = \c{i-l), i>l. (41) 
Hence, from c(0) = \, we can recursively obtain 

c(z) = (i)^+\ z>0. (42) 

APPENDIX 3: PROOF OF Corollary 3 

Similar to the proof of Corollary [H we apply z-transform to 
derive the expectation and the variance of the generation. 
First, note that Corollary |3] holds for n = 1 and 2. Next, 
when n > 3, we define z-transform 

i"nW=E"=0 5n(j>-^'. (43) 

Putting Equation (p^ into Equation (|43)l , we can obtain the 
difference equation of z-transform 

y„(2) = (i2-i + 2^)y„_i(^). (44) 

Note that E„[G] = -^^^ U=i and F„_i(l) = 1, which 
leads to 

E„[G] =E„_i[G] + i. (45) 
Since E2[G] = \, we can show by induction that 

E„[G]=J?„-1. (46) 
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Moreover, En[G^] = £ [^^^] \z=i yields 

E„[G2] = E„_i[G2] + fE„_i[G] + i. (47) 

Therefore, combining Equations (|45)l and (|47)l gives 

Var„[G] = E„[G2]-E2[G] 

= E„_i[G2] + i(2E„_i[G] + l) 

-(E„_i[G] + i)2 
= Var„_i[G] + i-;^. (48) 

Thus, we can use Var2 [G] = | to prove by induction that 

VaVnlG] = Hn - Hn,2, (49) 

where Hn = Yli^i j and i^n,2 = ElLi ^• 

APPENDIX 4: Proof of Corollary 4 

We prove this corollary by applying z-transform. If a 
random variable X follows a Poisson distribution with 
parameter A, 

\k 

Fr{X = k) = —-e-^, A; = 0,1,2,.-. . (50) 
k\ 

Using z-transform, we have 

CO 

X{z) = ^ Pr(X = k)z-^ = eM^"'-i). (51) 

Meanwhile, using Equation ([S\ in Theorem |3l we find the 
z-transform of gn{j) 

Yn{z) = E"Jo 9n{j)z-^ = (l + ^) r„_i (^). (52) 

Note that when x ^ 0, ^ 1 -\- x. Thus, when n is very 
large, 1 + ^ exp((z-^ - l)/n). That is, 

rn(z)^e^i;,_i(z). (53) 

Using Yi{z) = 1, we can recursively obtain 

r„(^)«e(^"'-i)S?-T =e(^"-i)(-"^-i). (54) 

Therefore, by comparing Equations l(5T|) and (|54ll , ^n(j) can 
be approximated by the Poisson distribution with parame- 
ter Hn — 1 as in Equation (|2T]l . 
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